机器学习(ML)模型,例如SVM,用于分类和序列的聚类等任务,需要定义序列对之间的距离/相似性。已经提出了几种方法来计算序列之间的相似性,例如确切的方法计算$ k $ -s-mers(长度$ k $的子序列)之间的匹配数和估计成对相似性得分的近似方法。尽管精确的方法产生了更好的分类性能,但它们的计算成本很高,将其适用性限制在少量序列中。事实证明,近似算法更可扩展,并具有相当的性能(有时更好)确切方法 - 它们以“一般”方式设计用于处理不同类型的序列(例如音乐,蛋白质等)。尽管一般适用性是算法的所需属性,但在所有情况下都不是这种情况。例如,在当前的Covid-19(冠状病毒)大流行中,需要一种可以专门处理冠状病毒的方法。为此,我们提出了一系列方法来提高近似内核的性能(使用最小化和信息增益),以增强其预测性能PM冠状病毒序列。更具体地说,我们使用域知识(使用信息增益计算)和有效的预处理(使用最小值计算)来提高近似内核的质量,以对与不同变体相对应的冠状病毒峰值蛋白序列进行分类(例如,Alpha,Beta,Beta,Gamma)。我们使用不同的分类和聚类算法报告结果,并使用多个评估指标评估其性能。使用两个数据集,我们表明我们提出的方法有助于与医疗保健领域的基线和最先进的方法相比,有助于提高内核的性能。
translated by 谷歌翻译
COVID-19大流行的快速扩散导致SARS-COV-2基因组的序列数据量很大,数百万序列和计数。尽管超出传统方法的能力来理解病毒的多样性,动态和演变的能力,但这一数量的数量幅度仍然是机器学习(ML)方法的丰富资源(ML)作为从这些数据中提取此类重要信息的替代方法。因此,设计一个用于测试和基准测试这些ML模型的鲁棒性的框架至关重要。本文(据我们所知)首次努力通过使用错误模拟生物学序列来基准ML模型的鲁棒性。在本文中,我们介绍了几种方法来扰动SARS-COV-2基因组序列,以模仿普通测序平台(例如Illumina和pacbio)的误差曲线。我们从在各种ML模型上的实验中证明,对于某些特定的嵌入方法,某些基于仿真的方法比其他针对输入序列的对抗性攻击更健壮(和准确)。我们的基准测试框架可以帮助研究人员正确评估不同的ML模型,并帮助他们了解SARS-COV-2病毒的行为或避免未来可能的大流行。
translated by 谷歌翻译
Covid-19大流行,仍然是未知的,是一个重要的开放问题。有猜测蝙蝠是可能的起源。同样地,有许多密切相关的(电晕)病毒,例如SARS,发现通过练习圈传递。对潜在的载体和致命病毒发射器的不同主体的研究对于了解,减轻和预防当前和未来的流行性至关重要。在冠状病毒中,表面(S)蛋白或尖峰蛋白是确定宿主特异性的重要组成部分,因为它是病毒与宿主细胞膜之间的接触点。在本文中,我们将超过五千个冠状病毒的刺激蛋白序列分类,将它们分离成艾滋病,蝙蝠,骆驼,猪,人类和奶酪中明显宿主的集群,以命名几个。我们提出了一种基于众所周知的位置重量矩阵(PWM)的特征嵌入,我们呼叫PWM2VEC,并用于从这些冠状虫病毒的尖峰蛋白序列产生特征向量。虽然我们的嵌入受到PWMS在生物应用中的成功,例如确定蛋白质功能,或识别转录因子结合位点,但我们是在来自病毒序列的宿主分类的上下文中使用PWM的第一个(我们的知识)生成固定长度的特征矢量表示。现实世界数据的结果显示,与使用PWM2VEC,与基线模型相比,我们能够相当良好地执行。我们还使用信息增益来测量不同氨基酸的重要性,以显示对预测给定冠状病毒的宿主来说重要的氨基酸。
translated by 谷歌翻译
随着Covid-19的快速全球传播,越来越多的数据与该病毒有关正在变得可用,包括基因组序列数据。目前在GISAID等平台上公开可用的基因组序列总数是数百万,每天都在增加。此类\ EMPH {Big Data}的可用性为研究人员提供了详细研究该病毒的新机会。这对Covid-19变体的所有动态尤其重要,其出现并循环。这种丰富的数据源将为我们提供对这一和未来大流行威胁的最佳方式的最佳方法,具有减轻或消除此类威胁的最终目标。分析和处理数百万基因组序列是一个具有挑战性的任务。虽然证明了序列分类的传统方法是有效的,但它们不设计用于处理这些特定类型的基因组序列。此外,大多数现有方法也面临着可扩展性问题。以前的研究被定制成冠状病毒基因组数据,提出用于使用尖峰序列(对应于基因组的随后),而不是使用完整的基因组序列,以执行不同的机器学习(ML)任务,例如分类和聚类。但是,这些方法遭受可扩展性问题。在本文中,我们提出了一种称为Spike2VEC的方法,对于可以用于下游ML任务的每个尖峰序列,一种称为Spike2VEC,高效且可伸缩的特征向量表示。通过实验,我们表明Spike2VEC不仅可以在数百万秒峰序列上可扩展,而且在预测精度,F1分数等方面也优越基线模型。
translated by 谷歌翻译
Three main points: 1. Data Science (DS) will be increasingly important to heliophysics; 2. Methods of heliophysics science discovery will continually evolve, requiring the use of learning technologies [e.g., machine learning (ML)] that are applied rigorously and that are capable of supporting discovery; and 3. To grow with the pace of data, technology, and workforce changes, heliophysics requires a new approach to the representation of knowledge.
translated by 谷歌翻译
Large language models have ushered in a golden age of semantic parsing. The seq2seq paradigm allows for open-schema and abstractive attribute and relation extraction given only small amounts of finetuning data. Language model pretraining has simultaneously enabled great strides in natural language inference, reasoning about entailment and implication in free text. These advances motivate us to construct ImPaKT, a dataset for open-schema information extraction, consisting of around 2500 text snippets from the C4 corpus, in the shopping domain (product buying guides), professionally annotated with extracted attributes, types, attribute summaries (attribute schema discovery from idiosyncratic text), many-to-one relations between compound and atomic attributes, and implication relations. We release this data in hope that it will be useful in fine tuning semantic parsers for information extraction and knowledge base construction across a variety of domains. We evaluate the power of this approach by fine-tuning the open source UL2 language model on a subset of the dataset, extracting a set of implication relations from a corpus of product buying guides, and conducting human evaluations of the resulting predictions.
translated by 谷歌翻译
Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.
translated by 谷歌翻译
Wildfires are a common problem in many areas of the world with often catastrophic consequences. A number of systems have been created to provide early warnings of wildfires, including those that use satellite data to detect fires. The increased availability of small satellites, such as CubeSats, allows the wildfire detection response time to be reduced by deploying constellations of multiple satellites over regions of interest. By using machine learned components on-board the satellites, constraints which limit the amount of data that can be processed and sent back to ground stations can be overcome. There are hazards associated with wildfire alert systems, such as failing to detect the presence of a wildfire, or detecting a wildfire in the incorrect location. It is therefore necessary to be able to create a safety assurance case for the wildfire alert ML component that demonstrates it is sufficiently safe for use. This paper describes in detail how a safety assurance case for an ML wildfire alert system is created. This represents the first fully developed safety case for an ML component containing explicit argument and evidence as to the safety of the machine learning.
translated by 谷歌翻译
Deep learning semantic segmentation algorithms have provided improved frameworks for the automated production of Land-Use and Land-Cover (LULC) maps, which significantly increases the frequency of map generation as well as consistency of production quality. In this research, a total of 28 different model variations were examined to improve the accuracy of LULC maps. The experiments were carried out using Landsat 5/7 or Landsat 8 satellite images with the North American Land Change Monitoring System labels. The performance of various CNNs and extension combinations were assessed, where VGGNet with an output stride of 4, and modified U-Net architecture provided the best results. Additional expanded analysis of the generated LULC maps was also provided. Using a deep neural network, this work achieved 92.4% accuracy for 13 LULC classes within southern Manitoba representing a 15.8% improvement over published results for the NALCMS. Based on the large regions of interest, higher radiometric resolution of Landsat 8 data resulted in better overall accuracies (88.04%) compare to Landsat 5/7 (80.66%) for 16 LULC classes. This represents an 11.44% and 4.06% increase in overall accuracy compared to previously published NALCMS results, including larger land area and higher number of LULC classes incorporated into the models compared to other published LULC map automation methods.
translated by 谷歌翻译
检测障碍对于安全有效的自动驾驶至关重要。为此,我们提出了NVRadarnet,这是一种深神经网络(DNN),它使用汽车雷达传感器检测动态障碍物和可驱动的自由空间。该网络利用从多个雷达传感器的时间积累的数据来检测动态障碍,并在自上而下的鸟类视图(BEV)中计算其方向。该网络还可以回归可驱动的自由空间,以检测未分类的障碍。我们的DNN是第一个使用稀疏雷达信号的同类DNN,以实时从雷达数据实时执行障碍物和自由空间检测。在实际的自动驾驶场景中,该网络已成功地用于我们的自动驾驶汽车。该网络在嵌入式GPU上的运行速度快于实时时间,并且在地理区域显示出良好的概括。
translated by 谷歌翻译